{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "monitor_training.ipynb",
      "provenance": [],
      "collapsed_sections": []
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/github/Stable-Baselines-Team/rl-colab-notebooks/blob/master/monitor_training.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "hyyN-2qyK_T2",
        "colab_type": "text"
      },
      "source": [
        "# Stable Baselines, a Fork of OpenAI Baselines - Monitor Training and Plotting\n",
        "\n",
        "Github Repo: [https://github.com/hill-a/stable-baselines](https://github.com/hill-a/stable-baselines)\n",
        "\n",
        "Medium article: [https://medium.com/@araffin/stable-baselines-a-fork-of-openai-baselines-df87c4b2fc82](https://medium.com/@araffin/stable-baselines-a-fork-of-openai-baselines-df87c4b2fc82)\n",
        "\n",
        "[RL Baselines Zoo](https://github.com/araffin/rl-baselines-zoo) is a collection of pre-trained Reinforcement Learning agents using Stable-Baselines.\n",
        "\n",
        "It also provides basic scripts for training, evaluating agents, tuning hyperparameters and recording videos.\n",
        "\n",
        "Documentation is available online: [https://stable-baselines.readthedocs.io/](https://stable-baselines.readthedocs.io/)\n",
        "\n",
        "## Install Dependencies and Stable Baselines Using Pip\n",
        "\n",
        "List of full dependencies can be found in the [README](https://github.com/hill-a/stable-baselines).\n",
        "\n",
        "```\n",
        "sudo apt-get update && sudo apt-get install cmake libopenmpi-dev zlib1g-dev\n",
        "```\n",
        "\n",
        "\n",
        "```\n",
        "pip install stable-baselines[mpi]\n",
        "```"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "gWskDE2c9WoN",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# Stable Baselines only supports tensorflow 1.x for now\n",
        "%tensorflow_version 1.x\n",
        "!apt install swig cmake libopenmpi-dev zlib1g-dev\n",
        "!pip install stable-baselines[mpi]==2.10.2 box2d box2d-kengz"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "zSU6gAetAjy1",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "import stable_baselines\n",
        "stable_baselines.__version__"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "FtY8FhliLsGm",
        "colab_type": "text"
      },
      "source": [
        "## Import policy, RL agent, Wrappers"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "BIedd7Pz9sOs",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "import os\n",
        "\n",
        "import gym\n",
        "import numpy as np\n",
        "import matplotlib.pyplot as plt\n",
        "\n",
        "from stable_baselines import DDPG, TD3\n",
        "from stable_baselines.ddpg.policies import LnMlpPolicy\n",
        "from stable_baselines.bench import Monitor\n",
        "from stable_baselines.results_plotter import load_results, ts2xy\n",
        "from stable_baselines.common.noise import AdaptiveParamNoiseSpec, NormalActionNoise\n",
        "from stable_baselines.common.callbacks import BaseCallback"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "RapkYvTXL7Cd",
        "colab_type": "text"
      },
      "source": [
        "## Define a Callback Function\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "pUWGZp3i9wyf",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "class SaveOnBestTrainingRewardCallback(BaseCallback):\n",
        "    \"\"\"\n",
        "    Callback for saving a model (the check is done every ``check_freq`` steps)\n",
        "    based on the training reward (in practice, we recommend using ``EvalCallback``).\n",
        "\n",
        "    :param check_freq: (int)\n",
        "    :param log_dir: (str) Path to the folder where the model will be saved.\n",
        "      It must contains the file created by the ``Monitor`` wrapper.\n",
        "    :param verbose: (int)\n",
        "    \"\"\"\n",
        "    def __init__(self, check_freq: int, log_dir: str, verbose=1):\n",
        "        super(SaveOnBestTrainingRewardCallback, self).__init__(verbose)\n",
        "        self.check_freq = check_freq\n",
        "        self.log_dir = log_dir\n",
        "        self.save_path = os.path.join(log_dir, 'best_model')\n",
        "        self.best_mean_reward = -np.inf\n",
        "\n",
        "    def _init_callback(self) -> None:\n",
        "        # Create folder if needed\n",
        "        if self.save_path is not None:\n",
        "            os.makedirs(self.save_path, exist_ok=True)\n",
        "\n",
        "    def _on_step(self) -> bool:\n",
        "        if self.n_calls % self.check_freq == 0:\n",
        "\n",
        "          # Retrieve training reward\n",
        "          x, y = ts2xy(load_results(self.log_dir), 'timesteps')\n",
        "          if len(x) > 0:\n",
        "              # Mean training reward over the last 100 episodes\n",
        "              mean_reward = np.mean(y[-100:])\n",
        "              if self.verbose > 0:\n",
        "                print(\"Num timesteps: {}\".format(self.num_timesteps))\n",
        "                print(\"Best mean reward: {:.2f} - Last mean reward per episode: {:.2f}\".format(self.best_mean_reward, mean_reward))\n",
        "\n",
        "              # New best model, you could save the agent here\n",
        "              if mean_reward > self.best_mean_reward:\n",
        "                  self.best_mean_reward = mean_reward\n",
        "                  # Example for saving best model\n",
        "                  if self.verbose > 0:\n",
        "                    print(\"Saving new best model to {}\".format(self.save_path))\n",
        "                  self.model.save(self.save_path)\n",
        "\n",
        "        return True"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "7c8VHsiXC7dL",
        "colab_type": "text"
      },
      "source": [
        "## Create and wrap the environment\n",
        "\n",
        "We will be using Lunar Lander environment with continuous actions"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "kmxIq5UeC3Nj",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 53
        },
        "outputId": "42232cb2-57b3-4e1f-daa4-c12d060cf9a9"
      },
      "source": [
        "# Create log dir\n",
        "log_dir = \"/tmp/gym/\"\n",
        "os.makedirs(log_dir, exist_ok=True)\n",
        "\n",
        "# Create and wrap the environment\n",
        "env = gym.make('LunarLanderContinuous-v2')\n",
        "# Logs will be saved in log_dir/monitor.csv\n",
        "env = Monitor(env, log_dir)"
      ],
      "execution_count": 5,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "/usr/local/lib/python3.6/dist-packages/gym/logger.py:30: UserWarning: \u001b[33mWARN: Box bound precision lowered by casting to float32\u001b[0m\n",
            "  warnings.warn(colorize('%s: %s'%('WARN', msg % args), 'yellow'))\n"
          ],
          "name": "stderr"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "80OxZ_uMDd4J",
        "colab_type": "text"
      },
      "source": [
        "## Define and train the DDPG agent"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "iaOPfOrwWEP4",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# Add some param noise for exploration\n",
        "param_noise = AdaptiveParamNoiseSpec(initial_stddev=0.1, desired_action_stddev=0.1)\n",
        "# Create the callback: check every 1000 steps\n",
        "callback = SaveOnBestTrainingRewardCallback(check_freq=1000, log_dir=log_dir)\n",
        "# Because we use parameter noise, we should use a MlpPolicy with layer normalization\n",
        "model = DDPG(LnMlpPolicy, env, param_noise=param_noise, verbose=0)\n",
        "# Train the agent\n",
        "model.learn(total_timesteps=int(1e5), callback=callback)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "qQ4bxRQZDuk1",
        "colab_type": "text"
      },
      "source": [
        "## Plotting helpers\n",
        "\n",
        "Stable Baselines has some built-in plotting helper, that you can find in `stable_baselines.results_plotter`. However, to show how to do it yourself, we are going to use custom plotting functions. "
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "h_kMEHmJm3P3",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "from stable_baselines import results_plotter\n",
        "\n",
        "# Helper from the library\n",
        "results_plotter.plot_results([log_dir], 1e5, results_plotter.X_TIMESTEPS, \"DDPG LunarLander\")"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "mPXYbV39DiCj",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "def moving_average(values, window):\n",
        "    \"\"\"\n",
        "    Smooth values by doing a moving average\n",
        "    :param values: (numpy array)\n",
        "    :param window: (int)\n",
        "    :return: (numpy array)\n",
        "    \"\"\"\n",
        "    weights = np.repeat(1.0, window) / window\n",
        "    return np.convolve(values, weights, 'valid')\n",
        "\n",
        "\n",
        "def plot_results(log_folder, title='Learning Curve'):\n",
        "    \"\"\"\n",
        "    plot the results\n",
        "\n",
        "    :param log_folder: (str) the save location of the results to plot\n",
        "    :param title: (str) the title of the task to plot\n",
        "    \"\"\"\n",
        "    x, y = ts2xy(load_results(log_folder), 'timesteps')\n",
        "    y = moving_average(y, window=50)\n",
        "    # Truncate x\n",
        "    x = x[len(x) - len(y):]\n",
        "\n",
        "    fig = plt.figure(title)\n",
        "    plt.plot(x, y)\n",
        "    plt.xlabel('Number of Timesteps')\n",
        "    plt.ylabel('Rewards')\n",
        "    plt.title(title + \" Smoothed\")\n",
        "    plt.show()\n"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "CQXx7HiSDt7_",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "plot_results(log_dir)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "PQmsSZUHKNRG",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        ""
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "WalAw3xy6_KF",
        "colab_type": "text"
      },
      "source": [
        "## TD3 vs DDPG\n",
        "\n",
        "TD3 is the successor of DDPG (cf [Documentation](https://stable-baselines.readthedocs.io/))"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "9mIfkTEe7BF6",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 53
        },
        "outputId": "e2071533-9571-471b-dd4b-884b11e001dc"
      },
      "source": [
        "# Create log dir\n",
        "log_dir = \"/tmp/gym/td3/\"\n",
        "os.makedirs(log_dir, exist_ok=True)\n",
        "\n",
        "# Create and wrap the environment\n",
        "env = gym.make('LunarLanderContinuous-v2')\n",
        "# Logs will be saved in log_dir/monitor.csv\n",
        "env = Monitor(env, log_dir)"
      ],
      "execution_count": 10,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "/usr/local/lib/python3.6/dist-packages/gym/logger.py:30: UserWarning: \u001b[33mWARN: Box bound precision lowered by casting to float32\u001b[0m\n",
            "  warnings.warn(colorize('%s: %s'%('WARN', msg % args), 'yellow'))\n"
          ],
          "name": "stderr"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "iUBiLbC77XP3",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# Create action noise because TD3 and DDPG use a deterministic policy\n",
        "n_actions = env.action_space.shape[-1]\n",
        "action_noise = NormalActionNoise(mean=np.zeros(n_actions), sigma=0.1 * np.ones(n_actions))"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "X3rcf3kO79fR",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 73
        },
        "outputId": "53eb9d1b-284d-412f-c888-6d1c21b16e9c"
      },
      "source": [
        "policy_kwargs = dict(layers=[400, 300])\n",
        "model_td3 = TD3(\"MlpPolicy\", env, action_noise=action_noise, buffer_size=int(1e5), policy_kwargs=policy_kwargs, verbose=0)"
      ],
      "execution_count": 12,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/stable_baselines/td3/td3.py:195: The name tf.train.AdamOptimizer is deprecated. Please use tf.compat.v1.train.AdamOptimizer instead.\n",
            "\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "UaJSLzMy_2a4",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# Create the callback: check every 1000 steps\n",
        "callback = SaveOnBestTrainingRewardCallback(check_freq=1000, log_dir=log_dir)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "C4BxFyzs7Opp",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# Train the agent\n",
        "model_td3.learn(total_timesteps=int(1e5), callback=callback)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "hLOnill58bK-",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "plot_results(log_dir)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "h-OLkiFa8hhz",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "results_plotter.plot_results([log_dir], 1e5, results_plotter.X_TIMESTEPS, \"TD3 LunarLander\")"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "TeVOCJA9HvIX",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        ""
      ],
      "execution_count": 0,
      "outputs": []
    }
  ]
}
